Search CORE

18 research outputs found

Transformer for Emotion Recognition

Author: Delbrouck Jean-Benoit
Publication venue
Publication date: 30/05/2018
Field of study

This paper describes the UMONS solution for the OMG-Emotion Challenge. We explore a context-dependent architecture where the arousal and valence of an utterance are predicted according to its surrounding context (i.e. the preceding and following utterances of the video). We report an improvement when taking into account context for both unimodal and multimodal predictions

arXiv.org e-Print Archive

Modulating and attending the source image during encoding improves Multimodal Translation

Author: Delbrouck Jean-Benoit
Dupont Stéphane
Publication venue
Publication date: 09/12/2017
Field of study

We propose a new and fully end-to-end approach for multimodal translation where the source text encoder modulates the entire visual input processing using conditional batch normalization, in order to compute the most informative image features for our task. Additionally, we propose a new attention mechanism derived from this original idea, where the attention model for the visual input is conditioned on the source text encoder representations. In the paper, we detail our models as well as the image analysis pipeline. Finally, we report experimental results. They are, as far as we know, the new state of the art on three different test sets.Comment: Accepted at NIPS Worksho

arXiv.org e-Print Archive

Object-oriented Targets for Visual Navigation using Rich Semantic Representations

Author: Delbrouck Jean-Benoit
Dupont Stéphane
Publication venue
Publication date: 17/12/2018
Field of study

When searching for an object humans navigate through a scene using semantic information and spatial relationships. We look for an object using our knowledge of its attributes and relationships with other objects to infer the probable location. In this paper, we propose to tackle the visual navigation problem using rich semantic representations of the observed scene and object-oriented targets to train an agent. We show that both allows the agent to generalize to new targets and unseen scene in a short amount of training time.Comment: Presented at NIPS workshop (ViGIL

arXiv.org e-Print Archive

Multimodal Compact Bilinear Pooling for Multimodal Neural Machine Translation

Author: Delbrouck Jean-Benoit
Dupont Stephane
Publication venue
Publication date: 23/03/2017
Field of study

In state-of-the-art Neural Machine Translation, an attention mechanism is used during decoding to enhance the translation. At every step, the decoder uses this mechanism to focus on different parts of the source sentence to gather the most useful information before outputting its target word. Recently, the effectiveness of the attention mechanism has also been explored for multimodal tasks, where it becomes possible to focus both on sentence parts and image regions. Approaches to pool two modalities usually include element-wise product, sum or concatenation. In this paper, we evaluate the more advanced Multimodal Compact Bilinear pooling method, which takes the outer product of two vectors to combine the attention features for the two modalities. This has been previously investigated for visual question answering. We try out this approach for multimodal image caption translation and show improvements compared to basic combination methods.Comment: Submitted to ICLR Workshop 201

arXiv.org e-Print Archive

An empirical study on the effectiveness of images in Multimodal Neural Machine Translation

Author: Delbrouck Jean-Benoit
Dupont Stéphane
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 04/07/2017
Field of study

In state-of-the-art Neural Machine Translation (NMT), an attention mechanism is used during decoding to enhance the translation. At every step, the decoder uses this mechanism to focus on different parts of the source sentence to gather the most useful information before outputting its target word. Recently, the effectiveness of the attention mechanism has also been explored for multimodal tasks, where it becomes possible to focus both on sentence parts and image regions that they describe. In this paper, we compare several attention mechanism on the multimodal translation task (English, image to German) and evaluate the ability of the model to make use of images to improve translation. We surpass state-of-the-art scores on the Multi30k data set, we nevertheless identify and report different misbehavior of the machine while translating.Comment: Accepted to EMNLP 201

arXiv.org e-Print Archive

Bringing back simplicity and lightliness into neural image captioning

Author: Delbrouck Jean-Benoit
Dupont Stéphane
Publication venue
Publication date: 15/10/2018
Field of study

Neural Image Captioning (NIC) or neural caption generation has attracted a lot of attention over the last few years. Describing an image with a natural language has been an emerging challenge in both fields of computer vision and language processing. Therefore a lot of research has focused on driving this task forward with new creative ideas. So far, the goal has been to maximize scores on automated metric and to do so, one has to come up with a plurality of new modules and techniques. Once these add up, the models become complex and resource-hungry. In this paper, we take a small step backwards in order to study an architecture with interesting trade-off between performance and computational complexity. To do so, we tackle every component of a neural captioning model and propose one or more solution that lightens the model overall. Our ideas are inspired by two related tasks: Multimodal and Monomodal Neural Machine Translation

arXiv.org e-Print Archive

Visually Grounded Word Embeddings and Richer Visual Features for Improving Multimodal Neural Machine Translation

Author: Delbrouck Jean-Benoit
Dupont Stéphane
Seddati Omar
Publication venue: 'International Speech Communication Association'
Publication date: 16/12/2017
Field of study

In Multimodal Neural Machine Translation (MNMT), a neural model generates a translated sentence that describes an image, given the image itself and one source descriptions in English. This is considered as the multimodal image caption translation task. The images are processed with Convolutional Neural Network (CNN) to extract visual features exploitable by the translation model. So far, the CNNs used are pre-trained on object detection and localization task. We hypothesize that richer architecture, such as dense captioning models, may be more suitable for MNMT and could lead to improved translations. We extend this intuition to the word-embeddings, where we compute both linguistic and visual representation for our corpus vocabulary. We combine and compare different confiComment: Accepted to GLU 2017. arXiv admin note: text overlap with arXiv:1707.0099

arXiv.org e-Print Archive

Adversarial reconstruction for Multi-modal Machine Translation

Author: Delbrouck Jean-Benoit
Dupont Stéphane
Publication venue
Publication date: 07/10/2019
Field of study

Even with the growing interest in problems at the intersection of Computer Vision and Natural Language, grounding (i.e. identifying) the components of a structured description in an image still remains a challenging task. This contribution aims to propose a model which learns grounding by reconstructing the visual features for the Multi-modal translation task. Previous works have partially investigated standard approaches such as regression methods to approximate the reconstruction of a visual input. In this paper, we propose a different and novel approach which learns grounding by adversarial feedback. To do so, we modulate our network following the recent promising adversarial architectures and evaluate how the adversarial response from a visual reconstruction as an auxiliary task helps the model in its learning. We report the highest scores in term of BLEU and METEOR metrics on the different datasets

arXiv.org e-Print Archive

Can adversarial training learn image captioning ?

Author: Delbrouck Jean-Benoit
Dupont Stéphane
Vanderplaetse Bastien
Publication venue
Publication date: 31/10/2019
Field of study

Recently, generative adversarial networks (GAN) have gathered a lot of interest. Their efficiency in generating unseen samples of high quality, especially images, has improved over the years. In the field of Natural Language Generation (NLG), the use of the adversarial setting to generate meaningful sentences has shown to be difficult for two reasons: the lack of existing architectures to produce realistic sentences and the lack of evaluation tools. In this paper, we propose an adversarial architecture related to the conditional GAN (cGAN) that generates sentences according to a given image (also called image captioning). This attempt is the first that uses no pre-training or reinforcement methods. We also explain why our experiment settings can be safely evaluated and interpreted for further works.Comment: Accepted to NeurIPS 2019 ViGiL worksho

arXiv.org e-Print Archive

Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

Author: Delbrouck Jean-Benoit
Dupont Stéphane
Tits Noé
Publication venue
Publication date: 05/10/2020
Field of study

This paper aims to bring a new lightweight yet powerful solution for the task of Emotion Recognition and Sentiment Analysis. Our motivation is to propose two architectures based on Transformers and modulation that combine the linguistic and acoustic inputs from a wide range of datasets to challenge, and sometimes surpass, the state-of-the-art in the field. To demonstrate the efficiency of our models, we carefully evaluate their performances on the IEMOCAP, MOSI, MOSEI and MELD dataset. The experiments can be directly replicated and the code is fully open for future researches.Comment: EMNLP 2020 workshop: NLP Beyond Text (NLPBT

arXiv.org e-Print Archive